DELOS: An Automatically Tagged Economic Corpus for Modern Greek

نویسندگان

  • Katia Kermanidis
  • Nikos Fakotakis
  • George K. Kokkinakis
چکیده

Text corpora resources have become an essential tool for Natural Language Processing tasks over the past years. A wide range of applications like information retrieval, ontology and terminology extraction require a sufficiently large corpus but of restricted domain. Manual tagging of such a corpus is very costly, making automatic annotation by a set of linguistic tools a very challenging idea. DELOS, described in this paper, is a Modern Greek corpus of economic domain consisting of 5 million word tokens, which is automatically tagged for morphology and shallow syntactic relations. The annotating tools described are embodied in an integrated system and their application to the corpus is performed using the GATE text engineering platform. The system output is a textual database marked up with the annotation tagset in plain text as well as in XML format. .

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Challenges in Extracting Terminology from Modern Greek Texts

This paper describes the automatic extraction of economic terminology from Modern Greek texts as a first step towards creating an ontological thesaurus of economic concepts. Unlike previous approaches, the domain-specific corpus utilized is varying in genre, and therefore rich in vocabulary and linguistic structure, while the pre-processing level is relatively low (basic morphological tagging, ...

متن کامل

Ensemble Learning of Economic Taxonomy Relations from Modern Greek Corpora

This paper proposes the use of ensemble learning for the identification of taxonomic relations between Modern Greek economic terms. Unlike previous approaches, apart from is-a and part-of relations, the present work deals also with relation types that are characteristic of the economic domain. Semantic and syntactic information governing the term pairs is encoded in a novel feature-vector repre...

متن کامل

Corpus based coreference resolution for Farsi text

"Coreference resolution" or "finding all expressions that refer to the same entity" in a text, is one of the important requirements in natural language processing. Two words are coreference when both refer to a single entity in the text or the real world. So the main task of coreference resolution systems is to identify terms that refer to a unique entity. A coreference resolution tool could be...

متن کامل

Minimal pairs and functional loads of sound contrasts obtained from a list of modern greek words

This paper reports on the initial results of our investigation into the distribution of speech sounds across the lexicon of Modern Greek (MG). The data we discuss ultimately derive from the list of orthographic word-types of a large general corpus of written MG. The orthographic word-types were automatically transcribed into their respective citation forms. Minimal pairs were automatically extr...

متن کامل

Learning Subcategorization Frames from Corpora: a Case Study for Modern Greek

Certain Natural Language Processing (NLP) applications such as parsing and semantic processing require complete lexicons that provide subcategorization information for a word of interest, i.e. the necessary information about the set(s) of syntactic constituents the word must combine with, in order for its meaning to be fully expressed. Modern Greek presents high flexibility in the allowable ord...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002